{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d180958e-8362-4453-ad21-78ec618bc624",
   "metadata": {},
   "source": [
    "# OneHot Encoding in Scikit-Learn with DataFrames of Mixed Column Types"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11bdfc69-a04f-462e-8c14-c0b66dfd1796",
   "metadata": {},
   "source": [
    "## Some Toydata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec0cb03a-9a9a-4cad-9716-40fc29641f9a",
   "metadata": {},
   "source": [
    "- Imagine we have some dataset that consists of both numerical and categorical features.\n",
    "- And we just want to convert the categorical features into a onehot encoding (while leaving the numerical features untouched)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "61f31b73-d486-4bc5-876a-86636c1acb86",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "02e244fd-76b0-430f-a002-4291d7d687e3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numerical</th>\n",
       "      <th>categorical</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.1</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.1</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.1</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.2</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6.1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7.1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8.1</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1.2</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2.1</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>3.1</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>4.1</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    numerical categorical\n",
       "0         1.1           b\n",
       "1         2.1           b\n",
       "2         3.1           b\n",
       "3         4.2           b\n",
       "4         5.1           a\n",
       "5         6.1           a\n",
       "6         7.1           a\n",
       "7         8.1           a\n",
       "8         1.2           c\n",
       "9         2.1           c\n",
       "10        3.1           c\n",
       "11        4.1           c"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_1 = [\n",
    "    1.1, 2.1, 3.1, 4.2,\n",
    "    5.1, 6.1, 7.1, 8.1,\n",
    "    1.2, 2.1, 3.1, 4.1\n",
    "]\n",
    "\n",
    "feature_2 = [\n",
    "    'b', 'b', 'b', 'b',\n",
    "    'a', 'a', 'a', 'a',\n",
    "    'c', 'c', 'c', 'c'\n",
    "]\n",
    "\n",
    "df = pd.DataFrame({'numerical': feature_1, 'categorical': feature_2})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8989975-b534-4c3f-bb4c-cbed1b7acaa5",
   "metadata": {},
   "source": [
    "## Onehot Encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26d8a4c3-a5b9-449e-ab27-421b93f95d9b",
   "metadata": {},
   "source": [
    "- We can use e.g., scikit-learn's [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to expand the categorical column into onehot-encoded ones\n",
    "- By default, the `OneHotEncoder` will expand all columns into categorical ones (this includes the numerical ones), which is not what we want if we have mixed-type datasets\n",
    "- We can use the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to select specific columns we want to transform, though"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "90ed36b9-a326-42dc-b7e4-aae1c4f95e50",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1. , 0. , 1.1],\n",
       "       [1. , 0. , 2.1],\n",
       "       [1. , 0. , 3.1],\n",
       "       [1. , 0. , 4.2],\n",
       "       [0. , 0. , 5.1],\n",
       "       [0. , 0. , 6.1],\n",
       "       [0. , 0. , 7.1],\n",
       "       [0. , 0. , 8.1],\n",
       "       [0. , 1. , 1.2],\n",
       "       [0. , 1. , 2.1],\n",
       "       [0. , 1. , 3.1],\n",
       "       [0. , 1. , 4.1]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sklearn\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "\n",
    "\n",
    "ohe = OneHotEncoder(sparse=False, drop='first', dtype='float')\n",
    "\n",
    "\n",
    "categorical_features = ['categorical']\n",
    "\n",
    "col_transformer = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('cat', ohe, categorical_features)],\n",
    "         # include the numerical column(s) via passthrough:\n",
    "         remainder='passthrough' \n",
    ")\n",
    "\n",
    "col_transformer.fit(df)\n",
    "X_t = col_transformer.transform(df)\n",
    "X_t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f9110d53-c0af-4929-ad53-a54a7459dbc3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pandas : 1.4.0\n",
      "sklearn: 1.0.2\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark --iversions"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 64-bit ('base': conda)",
   "language": "python",
   "name": "python392jvsc74a57bd0249cfc85c6a0073df6bca89c83e3180d730f84f7e1f446fbe710b75104ecfa4f"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
